Content - Calculating confidence intervals

Answers to exercises

Exercise 1

As all of these are Normal distributions, you can estimate the standard deviation as the distance from the mean to either of the inflexion points (where the probability density function changes from concave to convex). Of course, it is difficult to do this very precisely given the size and resolution of the graphs.
The standard deviation of the sample mean is $\dfrac{\sigma}{\sqrt{n}}$. We know that $\sigma = 7$, so from left to right the standard deviations are $\dfrac{7}{\sqrt{1}} = 7$, $\dfrac{7}{\sqrt{4}} = 3.5$, $\dfrac{7}{\sqrt{9}} = 2.3$, $\dfrac{7}{\sqrt{25}} = 1.4$.

Exercise 2

In general, for a random sample of size $n$ on $X$, we have $\mathrm{E}(\bar{X}) = \mu$ and $\mathrm{var}(\bar{X}) = \dfrac{\sigma^2}{n}$. If $X \stackrel{\mathrm{d}}{=} \exp(\tfrac{1}{7})$, we know that $\mu = \mathrm{E}(X) = 7$ and $\sigma^2 = \mathrm{var}(X) = 7^2 = 49$ (see the module Exponential and normal distributions ). Hence:
1. $\mathrm{E}(\bar{X}) = 7$
2. $\mathrm{var}(\bar{X}) = \dfrac{7^2}{10} = 4.9$
3. $\mathrm{sd}(\bar{X}) = \sqrt{\mathrm{var}(\bar{X})} = 2.21$.
As the histogram shown in figure 17 is based on one million means of random samples of size $n=10$ from $\exp(\tfrac{1}{7})$, we expect the mean, standard deviation and variance of the histogram to be close to the values calculated above.

Exercise 3

From the graph, we can see that the function does not take negative values; this is one property of a pdf. The second property is that the area under the curve is 1. This can be checked approximately by using the rectangles formed by the gridlines to estimate the area under the curve. Here is an attempt to guess what fraction of each rectangle is under the curve, starting with the rectangles in the bottom row (left to right), then the second row, and finally the small amount in the third row. In the units of the rectangles of the grid: \[ \text{Area} \approx (1 + 0.8 + 0.4 + 0.5 + 0.2) + (0.7 + 0.3) + (0.1) = 4.0. \] Each rectangle's area is $10 \times 0.025 = 0.25$. So, in fact, we have estimated the total area under the curve as 1, which is the exact value required for the function to be a probability density function. Of course, this is just an estimate, but it does demonstrate that the claim that the function is a pdf is plausible.
The mean of the corresponding random variable is 15.4. To guess the location of the mean, you need to imagine the region under the pdf as a thin plate of uniform material, placed on a see-saw corresponding to the $x$-axis. The mean is at the centre of gravity of the distribution, hence at the position required for a pivot that would make the distribution balance.
The standard deviation of the corresponding random variable is 12. This is harder to guess. For many distributions, including this one, about 95% of the distribution is within two standard deviations of the mean. On the lower side of the mean, all of the distribution is greater than $15.4 - 2 \times 12 = -8.6$. On the upper side, we have $15.4 + 2 \times 12 = 39.4 \approx 40$. How much of the area under the curve is greater than 40? We already estimated the area under the pdf between 40 and 50 as $0.2 \times 0.25 = 0.05$, leaving an estimated probability of 0.95 for the area under the pdf between 0 and 40. This informal evaluation is consistent with $\sigma = 12$, which is the correct value.

Exercise 4

A 0% confidence interval for $\mu$ is the point estimate $29.1$.
A 100% confidence interval for $\mu$ is certain to include $\mu$; it is $(-\infty, \infty)$. If the range of the random variable we are sampling from is restricted to $(a,b)$, then the 100% confidence interval for $\mu$ is $(a,b)$. This is always a useless interval: it tells us that the true mean is somewhere in the range of the variable, as it must be.

Exercise 5

True. The 95% confidence interval for this age group is $(9.52,10.38)$. The value 10 is in the confidence interval, so it is plausible that Australian children aged 12–14 use the internet for an average of 10 hours per week.
False. The 95% confidence interval is about plausible values for the true mean internet use in this age group, not about a range of values for the variable itself.
False. Again, the confidence interval is not about the range of potential values in the distribution of internet use. In fact, with a mean of 9.95 hours and a standard deviation of 7.81 hours, a value of 24 hours for some children in this age group is entirely plausible.

Exercise 6

The claim on the Venus bar wrapper is that the weight is 53 grams. If the claim is true, then the expected value of the sample mean from a sample of Venus bars would also be 53 grams, since $\mathrm{E}(\bar{X}) = \mu$. However, we know that the mean of a particular sample need not correspond exactly to this expectation. The average weight of Casey's 42 Venus bars is one gram heavier than the expected value (assuming the claim is true).
The approximate 95% confidence interval for the true mean weight of Venus bars, based on Casey's sample, is $54.0 \pm \bigl(1.96 \times \dfrac{0.98}{\sqrt{42}}\bigr)$. This is $54.0 \pm 0.30$, or $(53.7, 54.3)$.
The claim appears to be implausible, considering the confidence interval; the value of the claim is outside the 95% confidence interval. Of course, Casey may not mind: he is getting more chocolate than advertised, on average.
The method used for finding the confidence interval assumes that Casey's sample of Venus bars is a random sample from the population of Venus bars. We assume that the weights of the 42 Venus bars are independent; that is, the weight of a bar bought on one day is unrelated to that of a bar bought on another day. To assess the reasonableness of the assumptions, we need to know about the production of Venus bars and Casey's buying patterns. For example: Do errors in production occur in batches? Does Casey always buy from the same place?

Exercise 7

The bounds for the 80% confidence interval will be closer to the point estimate than the bounds of the 95% confidence interval. Your estimate for the lower bound of the 80% confidence interval should be greater than $53.7$, and your estimate for the upper bound should be less than $54.3$.
The value of the factor $z$ from the standard Normal distribution for an 80% confidence interval is 1.282. The ratio of the values of $z$ for the 80% and 95% confidence intervals is $\dfrac{1.282}{1.96} = 0.65$. Hence, the margin of error for the 80% confidence interval will be 0.65 times the margin of error for the 95% confidence interval. It will be about $0.20$, making the 80% confidence interval about $(53.8, 54.2)$.
The approximate 80% confidence interval for the true mean weight is $(53.8, 54.2)$, to one decimal place.
As the 95% confidence interval was not consistent with the claim, we would not expect the narrower 80% confidence interval to be consistent with the claim, and it is not.

Exercise 8

The distribution of weekly household expenditure on clothing and footwear is likely to be skewed with a long tail to the right, and the values for the mean and standard deviation are consistent with this.
If we wish to make an inference about the true mean expenditure, we need not be concerned about the shape of the distribution of weekly household expenditure on clothing and footwear, provided the sample size is large.
The sample size can be worked out as we know the standard deviation $s$ and the margin of error $E$ corresponding to a 95% confidence interval. The formula $E = 1.96 \times \dfrac{s}{\sqrt{n}}$ gives $2.9 = 1.96 \times \dfrac{145.8}{\sqrt{n}}$. Hence, the sample size is approximately 9710 households.
A 95% confidence interval for the true average weekly household expenditure on clothing and footwear is 44.50 pm 2.90, or ($41.60, $47.40).
An estimate of the average yearly expenditure can be obtained by multiplying the weekly estimate by 52.14; it is $2320. The approximate 95% confidence interval for the population mean yearly household expenditure on clothing and footwear is ($2169, $2471).